31.5 Text Mining
389
31.5
Text Mining
One consequence of the apparent reluctance of experimenters in the biological sci-
ences to assign numbers to the phenomena they investigate is that the experimental
literature is very wordy and hence voluminous. Indeed, the literature of biology
(the “bibliome”)—especially research papers published in journals—has become so
vast that even with the aid of review articles that summarize many results within
a few pages it is impossible for an individual to keep abreast of it, other than in
some very specialized part. Text mining in the first instance merely seeks to auto-
mate the search process, by considering, above all, facts uncovered by researchers.
Keyword searches, which nowadays can be extended to cover the entire text of a
research paper or a book, are straightforward—an instance of string matching (pat-
tern recognition)—but typically the results of such searches are nowadays themselves
too vast to be humanly processed, and more sophisticated algorithms are required.
Automated summarizing is available, based on selecting those sentences in which
the most frequent information-containing words occur, but this is generally success-
ful only where the original text is rather simply constructed. The Holy Graal in the
field is the automated inference of semantic information; hence, progress depends
on progress in automated natural language processing. Equations, drawings, and
photographs pose immense problems at present. Some protagonists even have the
ambition to automatically reveal new knowledge in a text, in the sense of ideas not
held by the original writer (e.g., hitherto unperceived disease–gene associations).
It would certainly be of tremendous value if automatic text processing could
achieve something like this level. 12 Research papers could be automatically com-
pared with one another, and contradictions highlighted. This would include not only
contradictory facts but also facts contradicting the predictions of hypotheses. High-
lighting the absence of appropriate controls, or inadequate evidence from a statistical
viewpoint, would also be of great value. In principle, all of this is presently done
by individual scientists reading and appraising research papers, even before they are
published, through the peer-review process, which ensures, in principle at least, that
a paper is read carefully by someone other than the author(s) at least once; papers
not meeting acceptable standards should not—again, in principle—be accepted for
publication, but the volume of papers being submitted for publication is now too
large to make this method rigorously workable. Another difficulty is the already
immense and still growing breadth of knowledge required to properly review many
papers. One attempt to get over that problem was to start new journals dealing with
small subsets of fields, in the hope that if the boundaries are sufficiently narrowly
delimited, all relevant information can be taken into account. However, this is a hope-
less endeavour: Knowledge is expanding too rapidly and unpredictably for it to be
possible to regulate its dissemination in that way. Hence, it is increasingly likely that
relevant facts are overlooked (and sometimes useful hypotheses too). Furthermore,
the reviewing process is highly fragmented: it is a kind of work that is difficult to
divide among different individuals, and the general trend for the number of scientists
12 Cf. the end of the introductory section in Chap. 27.